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Abstract 

We consider distributed optimization problems where networked nodes 
cooperatively minimize the sum of their locally known convex costs. A 
popular class of methods to solve these problems are the distributed gra¬ 
dient methods, which are attractive due to their inexpensive iterations, 
but have a drawback of slow convergence rates. This motivates the incor¬ 
poration of second order information in the distributed methods, but this 
task is challenging: although the Hessians which arise in the algorithm 
design respect the sparsity of the network, their inverses are dense, hence 
rendering distributed implementations difficult. We overcome this chal¬ 
lenge and propose a class of distributed Newton-like methods, which we 
refer to as Distributed Quasi Newton (DQN). The DQN family approx¬ 
imates the Hessian inverse by: 1) splitting the Hessian into its diagonal 
and off-diagonal part, 2) inverting the diagonal part, and 3) approximat¬ 
ing the inverse of the off-diagonal part through a weighted linear function. 

The approximation is parameterized by the tuning variables which corre¬ 
spond to different splittings of the Hessian and by different weightings of 
the off-diagonal Hessian part. Specific choices of the tuning variables give 
rise to different variants of the proposed general DQN method - dubbed 
DQN-0, DQN-1 and DQN-2 - which mutually trade-off communication 
and computational costs for convergence. Simulations demonstrate the 
effectiveness of the proposed DQN methods. 

Key words: Distributed optimization, second order methods, Newton¬ 
like methods, linear convergence. 
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1 Introduction 


We consider a connected network with n nodes, each of which has access to a 
local cost function fi : R p —► R, i = 1,... ,n. The objective for all nodes is to 
minimize the aggregate cost function / : —> R, defined by 

n 

f(y) = J2fi{y)- (!) 

i -1 

Problems of this form arise in many emerging applications like big data 
analytics, e.g., [7], distributed inference in sensor networks S3 na nn 0 , and 
distributed control, [ 29 ] . 

Various methods for solving 0 in a distributed manner are available in 
the literature. A class of methods based on gradient descent at each node 
and exchange of information between neighboring nodes is particularly popular, 
see S3 I3H 133 Ell II3 13 @3 133 133 ■ Assuming that the local costs fi ’s are 
strongly convex and have Lipschitz continuous gradients and that a constant 
step size a is used, these methods converge linearly to a solution neighborhood. 
With such methods, step size a controls the tradeoff between the convergence 
speed towards a solution neighborhood and the distance of the limit point from 
the actual solution, larger a means faster convergence but larger distance from 
the solution in the limit; see, e.g., ns], m- Distributed first order (gradient) 
methods allow for a penalty interpretation, where the distributed method is 
interpreted as a (centralized) gradient method applied on a carefully constructed 
penalty reformulation of the original problem |l]); see [15], [33] for details. 

Given the existence of well developed theory and efficient implementations 
of higher order methods in centralized optimization in general, there is a clear 
need to investigate the possibilities of employing higher order methods in dis¬ 
tributed optimization as well. More specifically, for additive cost structures 0 
we study here, a further motivation for developing distributed higher order 
methods comes from their previous success when applied to similar problems in 
the context of centralized optimization. For example, additive cost ([T]) is typical 
in machine learning applications where second order methods play an impor¬ 
tant role, see, e.g., [2133- Another similar class of problems arise in stochastic 
optimization, where the objective function is given in the form of mathematical 
expectation. Again, second order methods are successfully applied in centralized 

optimization, 03 [13 UM 133133- 

There have been several papers on distributed Newton-type methods. A 
distributed second order methods for network utility maximization and net¬ 
work flow optimization are developed in m and m but on problem formula¬ 
tions different from 0. Network Newton method [22] aims at solving 0 and 
presents a family of distributed (approximate) Newton methods. The class of 
Netwrok Newton method, refereed to as NN, is extensively analyzed in [23 131] • 
The proposed methods are based on the penalty interpretation, mm, of the 
distributed gradient method in [ 30 .;, and they approximate the Newton step 
through an £-th order Taylor approximation of the Hessian inverse, i = 0 , 1, ... 
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This approximation gives rise to different variants of methods within the family, 
dubbed NN-T Different choices of i exhibit inherent tradeoffs between the com¬ 
munication cost and the number of iterations until convergence, while NN-0, 1, 
and 2 are the most efficient in practice. The proposed methods converge linearly 
to a solution neighborhood, exhibit a kind of quadratic convergence phase, and 
show significantly better simulated performance when compared with the stan¬ 
dard distributed gradient method in EH]. Reference m proposes a distributed 
second order method which approximates the (possibly expensive) primal up¬ 
dates with the distributed alternating direction of multipliers method in |35j . 
In [35], the authors propose a distributed Newton Raphson method based on 
the consensus algorithm and separation of time-scales. Reference [25] proposes 
distributed second order methods based on the proximal method of multipli¬ 
ers (PMM). Specifically, the methods approximate the primal variable update 
step through a second order approximation of the NN-type [22]. While the NN 
methods in [22] converge to a solution neighborhood, the methods in [25] -27] 135] 
converge to exact solutions. 

In this paper, we extend [Hi m na and propose a different family of dis¬ 
tributed Newton-like methods for solving |l]). We refer to the proposed family 
as Distributed Quasi Newton methods (DQN). The methods are designed to 
exploit the specific structure of the penalty reformulation [151121], as is done 
in [22], but with a different Hessian inverse approximation, for which the idea 
originates in jl9| . Specifically, the Hessian matrix is approximated by its block 
diagonal part, while the remaining part of the Hessian is used to correct the 
right hand side of the quasi Newton equation. The methods exhibit linear con¬ 
vergence to a solution neighborhood under a set of standard assumptions for the 
functions /,; and the network architecture - each is strongly convex and has 
Lipschitz continuous gradient, and the underlying network is connected. Simu¬ 
lation examples on (strongly convex) quadratic and logistic losses demonstrate 
that DQN compares favorably with NN proposed in [22] , 

With the DQN family of methods, the approximation of the Newton step is 
parameterized by diagonal matrix at each iteration k, and different choices 
of Lfc give rise to different variants of DQN, which we refer to as DQN-0, 1, 
and 2. Different variants of DQN, based on different matrices L^, tradeoff the 
number of iterations and computational cost. In particular, setting L*, = 0 
yields the DQN-0 method; a constant, diagonal matrix L/ c = L corresponds to 
DQN-1. Finally, L*, with DQN-2 is obtained through approximately fitting the 
Newton equation (341 using a first order Taylor approximation. The DQN-1 
method utilizes the latter, DQN-2’s weight matrix at the first iteration, and 
then it “freezes” it to this constant value throughout the iterations; that is, it 
sets L = Lo, where Lo corresponds to the weight matrix of DQN-2 in the initial 
iteration. 

Let us further specify the main differences between the proposed DQN family 
and NN methods in [22] as the NN methods are used as the benchmark in 
the work presented here. First, the DQN methods introduce a different, more 
general splitting of Hessians with respect to NN, parameterized with a scalar 9 > 
0; in contrast, the splitting used in NN corresponds to setting 9 = 1. Second, 
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with the proposed variants of DQN-0, 1, and 2, we utilize diagonal matrices 
Lfc, while the NN-£ methods use in general block-diagonal or neighbor-sparse 
matrices. Third, DQN and NN utilize different inverse Hessian approximations; 
the NN’s inverse Hessian approximation matrix is symmetric, while with DQN 
it is not symmetric in general. Fourth, while NN approximates the inverse 
Hessian directly (independently of the Newton equation), DQN actually aims 
at approximating the Newton equation. Hence, unlike NN, the resulting DQN’s 
inverse Hessian approximation (with DQN-2 in particular) explicitly depends 
on the gradient at the current iterate, as it is the case with many Quasi-Newton 
methods; see, e.g., 0. Finally, the analysis here is very different from [ 221 . 
and the major reason comes from the fact that the Hessian approximation with 
DQN is asymmetric in general. This fact also incurs the need for a safeguarding 
step with DQN in general, as detailed in Section 3. We also point out that 
results presented in [22] show that NN methods exhibit a quadratic convergence 
phase. It is likely that similar results can be shown for certain variants of DQN 
methods as well, but detailed analysis is left for future work. It may be very 
challenging to rigorously compare the linear convergence rate factors of DQN 
and NN methods and their respective inverse Hessian approximations. However, 
we provide in Section 5 both certain analytical insights and extensive numerical 
experiments to compare the two classes of methods. 

As noted, DQN methods do not converge to the exact solution of (1), but 
they converge to a solution neighborhood, as it is the case with other methods 
(e.g., distributed gradient descent m and NN methods J2]) which are based 
on the penalty interpretation of (1). Hence, for very high accuracies, they may 
not be competitive with distributed second order methods which converge to the 
exact solution H3HE1I3E]- However, following the framework of embedding dis¬ 
tributed second order methods into PMM - developed in [25] , we apply here the 
DQN Newton direction approximations to the PMM methods; we refer to the 
resulting methods as PMM-DQN-f', £ = 0,1, 2. Simulation examples on strongly 
convex quadratic costs demonstrate that the PMM-DQN methods compare fa¬ 
vorably with the methods in [271HS1 EH] - Therefore, with respect to the existing 
literature and in particular with respect to [22] [25], this paper broadens the 
possibilities for distributed approximations of relevant Newton directions, and 
hence offers alternative distributed second order methods, which exhibit com¬ 
petitive performance on the considered simulation examples. Analytical studies 
of PMM-DQN are left for future work. 

This paper is organized as follows. In Section 2 we give the problem state¬ 
ment and some preliminaries needed for the definition of the method and con¬ 
vergence analysis. Section 3 contains the description of the proposed class of 
Newton-like methods and convergence results. Specific choices of the diagonal 
matrix that specifies the method completely are presented in Section 4. Some 
simulation results are presented in Section 5 while Section 6 discusses extensions 
of embedding DQN in the PMM framework. Finally, conclusions are drawn in 
Section 7, while Appendix provides some auxiliary derivations. 
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2 Preliminaries 


Let us first give some preliminaries about the problem ([I]), its penalty interpre¬ 
tation in puns, as well as the decentralized gradient descent algorithm in j30] 
that will be used later on. 

The following assumption on the /,’s is imposed. 

Assumption Al. The functions fi : R p —> R, i = 1, ... ,n are twice contin¬ 
uously differentiable, and there exist constants 0 < p < L < oo such that for 
every igR p 

pi < V 2 fi(x) d LI. 

Here, I denotes the p x p identity matrix, and notation M < N means that the 
matrix N — M is positive semi-definite. 

This assumption implies that the functions /), i = 1 are strongly 

convex with modulus p > 0, 

fi(z) > fdy) + Vfi{y) T {z-y) + ^\\z-y\\ 2 , y,zeW, ( 2 ) 

and the gradients are Lipschitz continuous with the constant L i.e. 

II V/i(y) - Vfi(z) || < L\\y-z\\, y,z£W, i = l,...,n. (3) 

Assume that the network of nodes is an undirected network Q = (V,£), 
where V is the set of nodes and £ is the set of all edges, i.e., all pairs {i,j} of 
nodes which can exchange information through a communication link. 
Assumption A2. The network Q = (V,£) is connected, undirected and simple 
(no self-loops nor multiple links). 

Let us denote by Oj the set of nodes that are connected with the node i 
(open neighborhood of node i) and let Oi = Oi IJ{*} (closed neighborhood of 
node i). We associate with Q a symmetric, (doubly) stochastic nxn matrix W. 
The elements of W are all nonnegative and rows (and columns) sum up to one. 
More precisely, we assume the following. 

Assumption A3. The matrix W = W T £ R nx " is stochastic with elements 
Wij such that 


Wij > 0 if {i, j} G £, Wij = 0 if {i, j} ^ £, i ^ j, and wu 


1 - Y, Wi o 

j&Oi 


and there are constants and w max such that for i = 1 ,,n 

0 ^ W m i n £ Wa ft Wmax 1. 

Denote by Ai > ... > A„ the eigenvalues of W. Then it can be easily seen 
that Ai = 1. Furthermore, the null space of I — W is spanned by e := (1,..., 1). 

Following ng, m , the auxiliary function <f> : R" p —»• M, and the correspond¬ 
ing penalty reformulation of (JT|) is introduced as follows. Let x = {x \,..., x n ) G 
R"P with Xi G R p , and denote by Z G U. npxnp the matrix obtained as the 
Kronecker product of W and the identity I G R pxp ,Z = W <8> I. 
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The corresponding penalty reformulation of ([I]) is given by 

" 1 

mm $(x) := fi(xi) +-x T (I-Z)x. (4) 

rcGM 71 * 3 z —' 2 

i =1 

Applying the standard gradient method to (|4]) with the unit step size we get 

x k+1 =x k -V<f>(x k ),k = 0,1,..., (5) 

which, denoting the i-th p x 1 block of x k by x k . and after rearranging terms, 
yields the Decentralized Gradient Descent (DGD) method [50] 

x i +1 = w v x d ~ aV /i(^ fe ), * = 1, • •., n. (6) 

j&Oi 

Clearly, the penalty parameter a influences the relation between 0 and 0 
- a smaller a means better agreement between the problems but also implies 
smaller steps in Q and thus makes the convergence slower. It can be shown, 
[22] that if y £ R p is the solution of ([T]) and x* := (yi,... , y„) £ is the 
solution of 0 then, for all i, 

\\Vi-y\\ = 0 (ol). (7) 

The convergence of 0 towards x* is linear, i.e., the following estimate holds 

uni m, 

$(x fc ) - $(**) < (1 - £) fc ($(z°) - $(**)), (8) 

where £ £ (0,1) is a constant depending on d>, a and W. 

The matrix and vector norms that will be frequently used in the sequel 
are defined here. Let ||a ||2 denote the Euclidean norm of vector a of arbitrary 
dimension. Next, ||^4 .||2 denotes the spectral norm of matrix A of arbitrary 
dimension. Further, for a matrix M £ ^npxnp w itb blocks £ R pxp , we will 
also use the following block norm: 


||M|| := max II M ijh, 

where, as noted, ||My ||2 is the spectral norm of Mij. For a vector x £ R np with 
blocks Xi £ R p , the following block norm is also used: 

n 

INI : = INI 2 - 

i =1 

3 Distributed Quasi Newton method 

In this section we introduce a class of Quasi Newton methods for solving Q. 
The general Distributed Quasi Newton (DQN) method is proposed in Subsection 
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3.1. The method is characterized by a generic diagonal matrix L^. Global linear 
convergence rate for the class is established in Subsection 3.2, while local linear 
convergence rate with the full step size e = 1 is analyzed in Subsection 3.3. 
Specific variants DQN-0, 1, and 2, which correspond to the specific choices of 
Lfc, are studied in Section 4. As we will see, algorithm DQN has certain tuning 
parameters, including the step size e. Discussion on the tuning parameters 
choice is relegated to Section 4. 

3.1 The proposed general DQN method 

The problem we consider from now on is 0 , where we recall Z = W®I and 
W satisfies assumption A3. 

The problem under consideration has a specific structure as the Hessian 
is sparse if the underlying network is sparse. However its inverse is dense. 
Furthermore, the matrix inversion (i.e. linear system solving) is not suitable 
for decentralized computation. One possibility of exploiting the structure of 
V 2 4>(:r) in a distributed environment is presented in 55] where the Newton 
step is approximated through a number of inner iterations. We present here a 
different possibility. Namely, we keep the diagonal part of V 2 <f>(x) as the Hessian 
approximation but at the same time use the non-diagonal part of V 2< f>(x) to 
correct the right hand side vector in the (Quasi)-Newton equation. Let us define 
the splitting 

W d = diag(W ) and W U = W- W d , 
and Z = 7L d + Z u with 

Z d = W d <8> I = diag(Z) and Z u = W u ® I. 

Here, diag(W ) denotes the diagonal matrix with the same diagonal as the matrix 
W. Hence, matrix Z d is a np x np diagonal matrix whose i -th px p block is the 
scalar matrix w^I , Z„ is a np x np block (symmetric) matrix such that ( [i,j)~ 
th p x p off-diagonal blocks are again scalar matrices WijI, while the diagonal 
blocks are all equal to zero. 

Clearly, the gradient is 

V4>(x) = aVF(x) + (I — Z)x, 
where I denotes the np x np identity matrix and 

n 

F(x) = VF(x) = (V/ 1 (x 1 ),... , Vf n (x n )), 

i =1 


while the Hessian is 

V 2 $(x) = «V 2 F(x) + I - Z 

where V 2 F(x) is the block diagonal matrix with the ith diagonal block V 2 /,;(xi). 

The general Distributed Quasi Newton, DQN, algorithm is presented below. 
Denote by k the iteration counter, k = 0,1,..., and let x k = (x\, —,x k ) € M np 
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be the estimate of x* at iteration k. Consider the following splitting of the 
Hessian 

V 2 $(x fc ) = A fc -G, (9) 

with 

A k =aV 2 F(x k ) + (l + 9)(I-Z d ) (10) 

and 

<G = Z U +9(I- Z d ) 

for some 9 > 0. Hence, G is a np x np block (symmetric) matrix whose ?’-th 

px p diagonal block equals ga /, with ga := 6 (1 — wu), while the (i, j)-th p x p 

off-diagonal block equals g%jl, with g tJ := Wij. One can easily see that the 
splitting above recovers the splitting for NN methods [22] taking 9 = 1. We 
keep 9 unspecified for now and later on, we will demonstrate numerically that 
taking 9 = 0 can be beneficial. Also, notice that A k is block diagonal with the 
ith diagonal block 

A k = ciS7 2 fi(x k ) + (1 + 9){ 1 - w u )I. 

Let Lfc £ K' n P xn P be a diagonal matrix composed of diagonal px p matrices 
A?, * = !>••• , n. In this paper, we adopt the following approximation of the 
Newton direction s^ = —(A k — G)~ 1 'V^(x k ): 

s k = -(I-L k G)A^ 1 V^(x k ). (11) 

The motivation for this approximation comes from [19] and the following rea¬ 
soning. Keep the Hessian approximation on the left hand side of the Newton 
equation (A*, — <G)s^ = — V$(a: fc ) diagonal, and correct the right hand side 
through the off-diagonal Hessian part. In more detail, the Newton equation can 
be equivalently written as: 

A fe 4 = -V$(x fe )+G4, (12) 

where the off-diagonal part s k := G s%- is moved to the right hand side. If we 
pretended for a moment to know the value of s fc , then the Newton direction is 
obtained as: 

s% = - (Afe)- 1 (V$(x fc ) - 3*) . (13) 

This form is suitable for distributed implementation due to the need to invert 
only the block diagonal matrix A k . However, s k is clearly not known, and hence 
we approximate it. To this end, note that, assuming that GV ( f > (a: /c ) is a vector 
with all the entries being non-zero, without loss of generality, s* can be written 
as follows: 

s k =L t GV$(i t ), (14) 

where L*, is a diagonal matrix. Therefore, estimating s* translates into estimat¬ 
ing the diagonal matrix L*,, assuming that GV c f > (a; fe ) is known. We follow [T91 
and consider simple approximations of the “true” L*,. E.g., we will consider 
Lfc = 0 which discards the off-diagonal Hessian part. Also, as we will see ahead, 


we adopt a Taylor approximation method for estimating L*,. Now, substituting 


(14) into (13), we obtain the following Newton direction approximation: 


Sn = - ' 


^ fc )- 1 (I-L fc G)V$(cr fc ). 


(15) 


Finally, we adopt as the definite form of the Newton direction approxima¬ 
tion, i.e., we permute the matrices (I — LfcG) and (A*,) 1 . The reason is that 
both s k and s § have the same inner product with V<f>(:r fc ), and hence they have 
the same descent properties (for example, Theorem 3.2 ahead holds unchanged 
for s k and s k ), while a careful inspection shows that s k allows for a cheaper (in 
terms of communications per iteration) distributed implementation. 


The reasoning above justifies restriction to diagonal La-’s, i.e., in view of (14), 
adopting diagonal L^.’s does not in principle, i.e., in structure sacrifice the qual¬ 
ity of the Newton direction approximation, while it is computationally cheap. 
Then, following a typical Quasi-Newton scheme, the next iteration is defined 

by 

x k+1 = x k +es k , (16) 


for some step size e. 

Clearly, the choice of La, is crucial in the approximation of (V 2 4>(x fc )) _1 . 
The following general algorithm assumes only that La, is diagonal and bounded. 
Specific choices of La, will be discussed in Section 4. Actually, all the proposed 
variants DQN-0, 1, and 2 utilize diagonal matrices L*,. Parameter 6 affects 
splitting © and the search direction For this moment we are assuming 

only that 0 is nonnegative and fixed initially, while further details are presented 
later on. 

In summary, the proposed distributed algorithm for solving Q is given be¬ 
low. 

Algorithm 1: DQN in vector format 

Given a: 0 € R np , s, p > 0. Set k = 0. 


Step 1. Chose a diagonal matrix La, G R n P xn P such that 


l|Lfc|| < p- 


Step 2. Set 

s k = -(I - LA,G)A^ 1 V$(s fe ). 

Step 3. Set 

x k + 1 =x k +es k , k = k+ 1. 

For the sake of clarity, the proposed algorithm, from the perspective of each 
node i in the network, is presented in Algorithm 2. 

Algorithm 2: DQN distributed implementation 

At each node i, require p, e > 0. 

1 Initialization: Each node i sets k = 0 and :r° G R p . 
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2 Each node i transmits x k to all its neighbors j £ Oi and receives x k from 
all j £ Oi . 

3 Each node i calculates 




»V/j(lj) + W ij ( X i - ZjO 

j&Oi 


4 Each node * transmits df to all its neighbors j £ Oi and receives d ] ,■ from 
all j £ 0 {. 

5 Each node i chooses a diagonal p x p matrix A*, such that ||A*|| 2 < P- 

6 Each node i calculates: 

j'eOi 

7 Each node z updates its solution estimate as: 


8 Set k = k + 1 and go to step 3. 

Calculation of A* in step 6 will be specified in the next section, and, for 
certain algorithm variants, will involve an additional inter-neighbor communi¬ 
cation of a p-dimensional vector. Likewise, choices of tuning parameters e, p , 9 
are discussed throughout the remaining of this section and Section 4. 

Remark. It is useful to compare (IT] ) with the direction adopted in NN meth¬ 
ods. Setting 9=1 and L& = 0 recovers NN-0, 9 = 1 and L* = — A^T 1 recovers 
NN-1, while NN-2 can not be recovered in a similar fashion. Hence, DQN in 
a sense generalizes NN-0 and NN-1. An approximation = L, that will be 
considered later on, does not recover any of NN methods. 

Observe that the approximation matrix (I — LfcG)A^' 1 is not symmetric in 
general. This fact induces the need for a safeguarding step in Algorithm 1, 
more precisely the elements of Lare uniformly bounded as stated in Step 1 
(Algorithm 1), ||L|| < p, and the resulting method requires an analysis different 
from [22] . 


3.2 Global linear convergence rate 

In this subsection, the global linear convergence rate of Algorithm DQN is estab¬ 
lished. The convergence analysis consists of two parts. First, we demonstrate 
that s k is a descent direction. Then we determine a suitable interval for the 
step size e that ensures linear convergence of the iterative sequence. 

The following Gershgorin type theorem for block matrices is needed for the 
first part of convergence analysis. 
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Theorem 3.1. f TTj For any C £ f lpx " p partitioned into blocks Cij of size 
p x p, each eigenvalue /i o/C satisfies 


1 


i¥=i 


(17) 


for at least one i £ {1,..., n}. 

Using the above theorem we can prove the following lower bound for all 
eigenvalues of a symmetric block matrix. 

Corollary 3.1. Let C £ R n P xn P be a symmetric matrix partitioned into blocks 
Cij of size p x p. Then each eigenvalue p of C satisfies 


p ^ min < A m i n (Cj.j) \ ' ||Uij ||2 / ? 

i=l,....n I z —' 

l J 

where X m i n (Ca) is the smallest eigenvalue of Ca. 

Proof. Given that C is symmetric, all its eigenvalues are real. Also, Ca 
is symmetric and has only real eigenvalues. Now, fix one eigenvalue p of the 
matrix C. By Theorem 3.1 there exists i £ {l,...,n}, such that holds. 
Next, we have 


II (Cii — pI)~ L || 2 = 


|A j{Cu) - p |’ 


where A j{Cu) is the j-th eigenvalue of Ca. Thus 

min \\j{Cii) - p\ < EllC'b lb- 

j=i ,—,p rrf 

We have just concluded that, for any eigenvalue p of C there exists * £ {1, 
and j £ {1,... ,p} such that p lies in the interval 

[A- E \\Ciih,\j{Cu) +E ll^lla]- 


i^l 




Hence, for each p for which holds for some fixed i, we have 

M > A min (C' ii ) - E H^lh 

l^i 

and the statement follows. □ 

We are now ready to prove that the search direction (11) is descent. 
Theorem 3.2. Suppose that A1-A3 hold. Let 


0 < p< 


ap+ (1 + 0)(1 - Wmax) 


(1 — Wmin){ 1 + Q) \OtL + (1 + 0)( 1 — W m in ) 


-<5 


(18) 
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for some 6 £ (0,1 /{aL + (1 + 0)(1 — w m i„))). Then s k defined by (11) is a 
descent direction and satisfies 


V T $(:r k )s k < -5||V3>(a; fc )|||. 

Proof. Let us first show that s k is descent search direction. As 

V T $(x k )s k = -V T $(z fe )(I - L fc G)A^ 1 V$(a; fc ) ) 

s k is descent if n T (I — Lk < &)Af 1 v > 0 for arbitrary v £ R n?xnp . Given that 
(I —LfcG)A^ 1 is not symmetric in general, we know the above is true if and only 
if the matrix 

C k = ±((I - L,.G)Afc 1 + A* 1 (I - GL fc )) 

is positive definite. C k is symmetric and thus it should be positive definite if all 
of its eigenvalues are positive. The matrix C fc is partitioned in the blocks 

C k = (A?)" 1 - ±0(1 - w u )(A k (A*)" 1 + (A k )~ 1 A k ), i = 1,... ,n, 

Cf 0 = - l - Wl] (A k {A k )- 1 + (A k )~ 1 A k ), * ? j. 

Corollary |3.1| implies that 

A mm (C fc ) > . min (\ min (C k f) - £ \\C k \\ 2 ) 


The definition of A& implies 

(an + (1 + 0)(1 - wu))I A A k A {aL + (1 + 0)(1 - wu))I (19) 
so, 

{an + (1 + 0)(1 — w max ))I A A k A {aL + (1 + 9)( 1 — w m i n ))I 
and therefore, for every i = 1 ,..., n 

ll( ^ } 1,12 - an + {i + e){\-w max y 


Moreover, it follows 


A min{C k ) > 


aL + (1 + 0)(1 — w m i n ) 


9(1- Wu)p 

an + (1 + ^)(1 10max ) 


and 


\\C?jh < 


_ Wjjp _ 

an (1 + 0)(1 10max) 
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Now, 


A min(C k ) > 


> 

> 


, 1 0(1 - Wu)p 

min (-- 

i=l,...,n Q.L -\- (1 -|- $)(1 Wmin ) “1“ (1 H - ^)(1 Wmax ) 

^ OLH A (1 + 0)(1 - w max ) 

min , _1_ p(l - Wu)(l + 9) ■ 

*=l,...,n qA + (1 + 0)(1 — Wmin) a/1 A (1 + 0)(1 — W max ) 

_1_ (1 - w min )( 1 + 6>) 

O-L + (1 + 0)( 1 — 

'Wmin ) OLfl + (1 + 0)(1 — Wmax ) 

(5. (20) 


Since (5 > 0 we conclude that <C fc is positive definite. Moreover, v T C k v = 
■u T (I - L fc G)A“ V for any v G R npxrip and 


V T <S>(x k )s k = -V T $(a; fe )(I-L fe G)A^ 1 V$(a; fe ) 

= -V T $(x fe )C ,fe V$( x k ) 

< (5|| V<I>(cc fe ) |||. 

( 21 ) 


□ 

The next lemma corresponds to the standard property of descent direction 
methods that establish the relationship between the search vector and the gra¬ 
dient. 

Lemma 3.1. Suppose that A1-A3 hold. Then 

\\s k h<mHx k )h, 


where 


1+ P (1 + 0)(1- 

Wmin ) 

eyp A (1 -f- 9)(1 Wmax) 


Proof. For matrix A*,, there holds that: 


, —11 


I 2 < 


eyp A (1 A d)(l w m ax) 


( 22 ) 


(23) 


This can be shown similarly as the upper bound on ||(Af) 1 1 |2 below (19). 
Furthermore, 

||Gp < (1 + 0)(1 — w m i n ). (24) 

This is true as 

||q | 2 = \\Z u A9(l-Z d )\\ 2 <\\Z u \\ 2 A9\\I-Z d \\ 2 

— P — ^dp + $P ~ ^dp < (1 + 9)( 1 — W m i n ), 
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where we used Z u = Z — Z^ A I — Z^ A (1 — tu m i n )I. Therefore, we have: 



< 


||(I-L fc G)A- i || 2 ||V$(^)|| 2 
(1 + ||Lfc|| 2 ||<G|| 2 )||A 
1 + p(\ + d)(l - w. 


A 1 ! 


a/l + (1 + 8)( 1 - VJ m ax ) 


2 || V<l>(:r fc )|| 
| V <3>(a: fe ) || 


0 


2 

2- 


□ 

Let us now show that there exists a step size e > 0 such that the sequence 
{x k } generated by Algorithm DQN converges to the solution of 0. Notice 
that 0 has a unique solution, say x*. Assumption Al implies that V<f>(:r) is 
Lipschitz continuous as well, i.e., with L := aL + 2(1 — w m i n ), there holds 


||V$(a:) - V$(y)|| 2 < L\\x - y || 2 , x,y G M" p . (25) 


Furthermore, 

£ II* - *11 < <&(*) - $(**) < iliv^^lll (26) 

2 (i 

for jl = ay and all x £ R” p . The main convergence statement is given below. 
Theorem 3.3. A ssume that the conditions of Theorem\3.2\ are satisfied. Define 


£ = 


s 

J 2 L 


(27) 


with (3 given by (22). 
that 


Then Algorithm DQN generates a sequence {x k } such 


lim x k = x* 

k—foo 


and the convergence is at least linear with 

S 2 y 


3>{x k+1 ) - $(**) < 1- 


2L(3 2 


($(x k ) - $(**)), 


Proof. The Mean Value Theorem, Lipschitz property 


A = 0,1,.... 

of V<1>, Theorem 1 3.2 1 
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and Lemma [3. 1 1 yield 

$(x k+1 ) - $(**) = $(x fe +£:s fe )-$(x*) 

= $(:r fc ) + f V T $(a c k +t£s k )es k dt - §{x*)±£S7 T $(x k )s k 
Jo 

< $(x k ) - $(x*) + e [ \\V T $(x k + tes k )-\7 T <$>{x k )\\2\\s k \\ 2 dt 

Jo 


+ eV T $(x fe )s fc 


< $(x k )-<Z>(x*) + £ Lt£\\s k \\idt + eX7 1 <f>(x k )s k 

Jo 

= $(:r fc ) - $(s*) + ^£r 2 Z||s fc ||l + eV T $(a : k )s k 

< $(x fe ) - <&(**) + P 2 ^e 2 \\V^x k )\\ 2 2 - eS\\ V$(a: fc )||l 


= $(x fc ) - $(s*) + ^-e 2 - ed ||V 2 $(a; fe )|| 


‘l^( r Jz M|2 
2‘ 


(28) 


Define 


,, ^ 2 c 

4>{£) = —£ - £6. 


Then <j>(0) = 0, <f/(e) = Ze/3 2 — S and > 0. Thus, the minimizer of <j> is 

e* = 5/(/3 2 L) and 

= (29) 


2j3 2 L 


Now, (28) and (291 give 


S 2 


$(a; fc+i ) - $(®*) < $(x k ) - $(z*)-—1|V<L>(a; 


fc Ml 2 


2j3 2 L 


From (26), we also have 


and 


so 


<f>{x k ) - $(®*) < — II V<E>(o ; fe )||2 


b 2 b 2 u 

||V$0c*)lll <“(*(**) “*(**)) A 


2f3 2 L 


2 f3 2 L’ 


5 2 p 


$(x k+1 ) - < [l - ($(^) - $(**)). 

Given that p = ap < aL < L , we have p/L < 1. Moreover, 

1 1 


<5 < 


< 


< 


aL -(-(IT 0)(1 U'rn.in ) T (1 T $)(1 IXmax) 

1 + P(1 + 0)(1- 

Wmin ) 


a/r+ (1 + 0)(1 - u; m ax) 


= /3 
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and 


We conclude with 


£“1- 


5 2 fi 

2/3 2 L 


e 


( 0 , 1 ). 


4>(a; fc+1 ) _ $(a;*) < £ _ $(a:*)) 

and 

lim 4>(;r fe ) = <f>(a;*). 

k—> oo 

As $ £ C' 2 (K n ), the above limit also implies 

lim x k = x*. 

k—yoo 

□ 

The proof of the above theorem clearly shows that for any e £ (0, 5/(/3 2 L)\ 
algorithm DQN converges. However, taking e as large as possible implies larger 
steps and thus faster convergence. 


3.3 Local linear convergence 

We have proved global linear convergence for the specific step length e given 
in Theorem 3.3 However, local linear convergence can be obtained for the full 


step size, using the theory developed for Inexact Newton methods [8j. The step 
s k can be considered as an Inexact Newton step and we are able to estimate the 
residual in Newton equation as follows. 

Theorem 3.4. Suppose that A1-A3 hold. Let x k he such that V4>(a; fc ) ^ 0. 
Assume that s k is generated in Step 2 of Algorithm 1 with 

ap 


0 < p < 


(1+0)(1- ^min )(aL + 2(1 + <9)(1 ^rain) ) 


Then there exists t £ (0,1) such that 


|V4>(x fc ) + V 2 4>(ar)s fc || < f||V$(a; fc )||. 


Proof. First, notice that the interval for p is well defined. The definition of 
the search direction and the splitting of the Hessian yield 

V4>(:r fe )+V 2 4>(:r fc )s fc = (GA^ 1 +A fc L fc GAfc 1 -GL fe GA^ 1 )V$(a: fe ) := Q fe V$(a; fe ). 

Therefore, 

IIQfcll < IIga^h + ||GA^||M|A*|| + HGA^lHMiiqi. 


Moreover, 



< 


max(0(l-^)||(A J fc )- 1 || + £ «;«||(^)- 1 ||) 
3 ieo. 


max 

j 


+ Wjj 

ap. + (1 + 0)(1 — Wjj) 
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Recalling (19) and the fact that the expression above is decreasing with respect 
to Wjj , we get 

(30) 


-In < (l + fl)(l Wmjn) _ 

k ~ an + (1 + 9){l - Wmin) 


and there holds ||Afe|| < aL + (1 + 8)( 1 — Wmin)- Furthermore, (23), (24) and 
||L fe || < p imply 

IIQfcll < 7 + 7P(1 + ^)(1 — W-min ) + 7 p(aL + (1 + 0)(1 - 

= 7 + pj(aL + 2(1 + 0)(1 — 'LV min )) (31) 

< 7 + 1 — 7 = 1. 


Thus, the statements is true with 

1 = 7 + Pl{aL + 2(1 + d)(l - w min )). 


(32) 


□ 

Theorem 3.4 introduces an upper bound on the safeguard parameter p dif¬ 
ferent than the one considered in Theorem 3.2. The relation between the two 
bounds depends on the choice of 5 in Theorem 3.2. Taking a sufficiently small 
S in Theorem 3.2, we obtain that p in Theorem 3.2 is larger. However, taking 
S < aL+(i+8)(i-w min ) sufficiently close to aL+[1+ o)(i - Wmin ) , P in Theorem 3.4 
eventually becomes larger. 

One way to interpret the relation between Theorems 3.2 and 3.3 on one 
hand, and Theorem 3.4 on the other hand, as far as p is concerned, is as follows. 
Taking a very small <5, Theorem 3.3 allows for a quite large p but on the other 
hand it significantly decreases the admissible step size e. At the same time, 
Theorem 3.4 corresponds in a sense to an opposite situation where e is allowed 
to be quite large (in fact, equal to one), while p is quite restricted. Therefore, 
the two results exploit the allowed “degrees of freedom” in a different way. 

For the sake of completeness we list here the conditions for local convergence 
of Inexact Newton methods. 

Theorem 3.5. Assume that Al holds and that s k satisfies the inequality 

|| VH>(cc fc ) + S7 2 $(x k )s k || < t||V4>(a; fe )||, k = 0,1,... 

for some t < 1. Furthermore, assume that x k+1 = x k + s k , k = 0,1,.... Then 
there exists 77 > 0 such that for all ||a ; 0 — x*|| < 77, the sequence {aA} converges 
to x*. The convergence is linear, 

\\x k+1 - £*||* < t\\x k - 2 *||*, k = 0 , 1 , . . . , 
where ||y||* = ||V 2 $( 2 *)y||. 

The two previous theorems imply the following Corollary. 
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Corollary 3.2. Assume that the conditions of Theorem 3.4 hold. Then there 
exists 77 > 0 such that for every x° satisfying ||rr° — x*|| < 77 , the sequence {x A: } 
generated by Algorithm DQN and e = 1 converges linearly to x* and 


|V 2 $(x*)(x 


fc+i 


< f||V 2 $(x*)(x fc 


k = 0 , 1 , 


holds with t £ ( 0 , 1 ). 

For (strongly convex) quadratic functions fi, i = 1,..., n we can also claim 
global linear convergence as follows. 


Theorem 3.6. A ssume that all loss fun ctions fi are strongly convex quadratic 
and that the conditions of Theorem 3.4 are satisfied. Let {x k } be a sequence 
generated by Algorithm DQN with e = 1. Then lirn^oo x k = x* and 

< t ||x fc — x*||*, k = 0 , 1 ,... 


||x fe+ 1 -x* 


for t defined in Theorem \ 3.J\ 

Proof. Given that the penalty term in ([4]) is convex quadratic, if all local 
cost functions fi are strongly convex quadratic, then the objective function $ 
is also strongly convex quadratic, i.e., it can be written as 

d>(x) =-(x — x*) T B(x — x*), (33) 


for some fixed, symmetric positive definite matrix B £ ]g™pxrap. R eca n that x* 
is the global minimizer of <f>. Then 

V$(x) = B(x - x*) and V 2 $(x) = B. 


Starting from 

s k = -(V 2 $(x*)) _ 1 V$(x fc ) + e fe , 

we get 

|! V 2 $(x fc )e fc || = ||V 2 3'(x fe )(s fe + (V 2 $(x fc ))- 1 V$(x fc ))|| 

= ||V$(x fc ) + V 2 $(x fc )s fc || < f||V$(x fe )|| 

by Theorem |3.4| Next, 



x k+i = x k + s k = x k _ (v 2 $( a; fe ))- 1 V4'(x fc ) + 


= x k -B^ 1 V$(x fc ) + e fc , 

and 

x k+i ^ = x k _ x * _ B" 1 V$(x fc ) + e k . 

Therefore, 

B(x fc+1 - x*) = B(x fe - x*) - V$(x fe ) + Be fc . 

Now, 

||Be fc || = ||V 2 < f>(x fe )e fc || < f||V$(x fc )|| = f||B(x fc - x* 

and 

||B(x fe+1 - x*)|| = ||Be fe || < f||B(x fe - x*)||. 

□ 
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4 Variants of the general DQN 

Let us now discuss the possible alternatives for the choice of L/-. Subsection 
4.1 presents three different variants of the general DQN algorithm which mutu¬ 
ally differ in the choice of matrix Lfc. We refer to the three choices as DQN-0, 
DQN-1, and DQN-2. All results established in Section 3 hold for these three al¬ 
ternatives. Subsection 4.1 also provides local linear convergence rates for DQN-2 
without safeguarding. Subsection 4.2 gives a discussion on the algorithms’ tun¬ 
ing parameters, as well as on how the required global knowledge by all nodes 
can be acquired in a distributed way. 


4.1 Algorithms DQN-0, 1, and 2 


The analysis presented so far implies only that the diagonal matrix L^. has to 
be bounded. Let us now look closer at different possibilities for defining L*,, 
keeping the restrictions stated in Theorem |3.2| and |3.4| 

DQN-0. We first present the method DQN-0 which sets = 0. Clearly, for 
DQN-0, Theorems 3.2 and 3.4 hold, and thus we get linear convergence with the 
proper choice of e, and local linear convergence with e = 1. The approximation 
of the Hessian inverse in this case equals A^ 1 , i.e., the Hessian is approximated 
by its block diagonal part only. The method DQN-0 corresponds to Algorithm 1 
with only steps 1-4 and 7-9 executed, with A* 1 = 0 in step 7. Clearly, choice 
Lfc = 0 is the cheapest possibility among the choices of L* if we consider the 
computational cost per iteration k. The same holds for communication cost 
per k, as each node needs to transmit only x* per each iteration, i.e., one 
p-dimensional vector per node, per iteration is communicated. We note that 
DQN-0 resembles NN-0, but the difference in general is that DQN-0 uses a 
different splitting, parameterized with 6 > 0; actually, NN-0 represents the 
special case with 6 = 1. 

DQN-1. Algorithm DQN-1 corresponds to setting = L, k = 0,1,..., 
where L is a constant diagonal matrix. Assuming that L is chosen such that 


< p , with p specified in Theorem 3.2 global linear convergence for a proper 


step size e and local linear convergence for the full step size e = 1 again hold. 
Algorithm DQN-1 is given by Algorithm 1, where each node utilizes a constant, 
diagonal matrix A,;. There are several possible ways of choosing the Aj’s. In this 
paper, we focus on the following choice. In the first iteration k = 0, each node i 
sets matrix A° through algorithm DQN-2, stated in the sequel, and then it keeps 
the same matrix A^ 1 throughout the whole algorithm. The computational cost 
per iteration of DQN-1 is higher than the cost of DQN-0. At each iteration, each 
node i needs to compute the corresponding inverse of z-th block of A^ and then 
to multiply it by the constant diagonal matrix A*. Regarding the communication 
cost, each node transmits two p-dimensional vectors per iteration - xf and d* 
(except in the first iteration k = 0 when it also transmits an additional vector 
it°; see ahead Algorithm 3). Although the focus of this paper is on the diagonal 
Lfc’s, we remark that setting 0 = 1 and L= —A^7 1 recovers the NN-1 method. 

DQN-2. Algorithm DQN-2 corresponds to an iteration-varying, diagonal 


19 



matrix L*,. Ideally, one would like to choose matrix Lj, such that search direction 
s k resembles the Newton step as much as possible, with the restriction that h k 
is diagonal. The Newton direction s%- satisfies the equation 

V 2 $(x fc )s^ + V$(x fe ) = 0. (34) 


We seek h k such that it makes residual M{h k ) small, where M{h k ) is defined 
as follows: 

M(L fe ) = ||V 2 $(x fc )s fc + V4>(x fe )||. (35) 

Notice that 

M( L fe ) = ||V 2 $(x fe )s fe + V$(x fc )|| 

= || - V 2 $(x fe )(/ - L fc G)A- 1 V$(x fe ) + V$(x fc )|| 

= || - V 2 $(x fc )A j ^ 1 V$(x fc ) + V 2 $(x k )h k GA^ 1 V$(x fc ) + V$(x fe )|| 

= || - {A k - G)A" 1 V$(x fc ) + V 2 $(x fe )L fe GA^ 1 V$(x fe ) + V$(x fc )|| 
= ||GA^ 1 V$(x fe ) + V 2 $(x fc )L fc GA- 1 V4>(x fc )||. 

Therefore, 

V 2 4>(x fe )s fe + V$(x fc ) = u k + V 2 $(x fc )L k u k , (36) 

where 

u k = GA^ 1 V$(x fc ). 

The minimizer of M(h k ) is clearly achieved if L*, satisfies the equation 


L k u k = -(V 2 $(x fc )) 




(37) 


but (37) involves the inverse Hessian. Thus we approximate (V 2 <f>(x fc )) 1 by 
the Taylor expansion as follows. Clearly, 


(V 2 $(a fc )) _1 = (aV 2 F(x fe ) +1 - Z)" 1 = (I - V*)" 1 , V k =Z- aY 2 F{x k ). 

(38) 

Assume that a < (1 + X n )/L, with X n being the smallest eigenvalue of W. Then 


Similarly, 

Hence, 


Y k >i {X n — aL)I> -1. 

Vfc < (1 — a/x) I -< I. 


p(V fe ) < ||V fc || 2 < 1. 

Therefore, I — Y k is nonsingular, 


(I-Vfc)” 1 = I + V fc + ^V l fe , 

i =2 

and the approximation 

(V 2 $(® fc ))- X = (I — Vfc)- 1 «I + Vfc 


(39) 
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(40) 


is well defined. So, we can take L/ c which satisfies the following equation 

L k u k = -(I + V k )u k . 


Obviously, L*, can be computed in a distributed manner. We refer to the method 
which corresponds to this choice of L k as DQN-2. The algorithm is given by 
Algorithm 2 where step 6, the choice of L* = diag( Ai,..., A n ), involves the steps 
presented below in Algorithm 3. Denote by u k the i- th p x 1 block of u k - the 
block which corresponds to node i. 

Algorithm 3: Choosing Lfc with DQN-2 

6.1 Each node i calculates 

ieo< 

6.2 Each node i transmits u k to all its neighbors j £ Oi and receives u k from 
all j £ O l . 

6.3 Each node i calculates A^ - the solution to the following system of equa¬ 
tions (where the only unknown is the p x p diagonal matrix A k ): 

A i u k = - [ (1 + wa)I - a V 2 /,( x k ) ] u k - W H u j- 

j&Oi 


6.4 Each node i projects each diagonal entry of A^ onto the interval [—p, p\. 

Note that step 6 with algorithm DQN-2 requires an additional p-dimensional 
communication per each node, per each k (the communication of the u k, s.) 
Hence, overall, with algorithm DQN-2 each node transmits three p-dimensional 
vectors per k - x k , d k , and u\. 

We next show that algorithm DQN-2 exhibits local linear convergence even 
when safeguarding (Step 6.4 in Algorithm 3) is not used. 

Theorem 4.1. Suppose that A1-A3 hold and let x k he an arbitrary point such 
that V4‘(a: fc ) ^ 0. Assume that 


a < min 


1 T A„ Wmin ^ 
_ L~' 2L : L2 J ’ 


(41) 


and s k is generated by [11] and Algorithm 2, Steps 6.1 -6.3. 
t G (0,1) such that 


Then there exists 


||V$0r fc ) + V 2 $(a; fe )s fe || < t\[ VT>(m fe )||. 


Proof. Using (36) and (|40[) we 


obtain 


|| V 2 <h(a; fe )s fe + V$(a; fe )|| 


\\u k + V 2 <S>(x k )L k u k \\ 

\\u k -S/ 2 $(x k ){I + V k )u k \\ 

||(I - V 2 4>(a; fe )(I + Z - aV 2 F(x k )))u k |[ 
||P fc zt fe ||, (42) 
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where 


P fc = I-V 2 $(a: fc )(I + Z- a\7 2 F{x k )) 

= I — (I + aX/ 2 F(x k ) — Z) (l + Z — aV 2 F{x k )) 

= Z 2 - a{ZV 2 F(x k ) + V 2 F( x k )Z) + (aS7 2 F{x k )) 2 

Since ||V 2 /i(a;,;)|| < L , there follows ||V 2 -F(a; fc )|| < L and the previous equality 
implies 

||P fc || < ||Z 2 - a(Z V 2 F{x k ) + \7 2 F{x k ) Z)|| + a 2 L 2 := \\U k \\ + a 2 L 2 . (43) 

Now, 

n 

Uij = Yw ik w kj I - a(wijV 2 fj(Xj) + WijV 2 fi(Xi)). 

k =1 

Furthermore, the assumption a < w m i n /(2L) implies 


n 

Y WjkWkj > WiiWij > WijWmin > Wij2aL > Wij2afi. 

k= l 

Moreover, V 2 fj(x k ) F /i/ and 


n 

Wij\\2 < ^ ~2m k w kj - 

k =1 


Therefore, 


n n 

\\U k \\ < max Y(Yw ik w kj - 2a^ Wij ) 

1=1,...,n z —' z —' 

2—1 k=l 


max Y w kj Y w ik ~ 2a M Y. ' 

7 =l,...,n z —' 1 z —' z —' 

k=1 2—1 2=1 


= 1 — 2an. 


So, 


|!P fc || < h(a), 


(44) 


where h(a) = 1 — 2afj, + a 2 L 2 . This function is convex and nonnegative since 
H < L and therefore 


min 

a 




> o. 


Moreover, h{ 0) = h (|§) = 1 and we conclude that for all a € (0,2/x/T 2 ) there 
holds h(a) € (0,1). As 

u k = G A“ 1 V$(a; fe ), 
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we have 


(45) 


Now, 


Therefore, 


* x l 


= max(0(l — Wjj)\\(A k ) *|| + w ij\\( A j) X | 


igo. 


< max 


_ d ( l - ™jj) + 1 - w n 
T' an + (1 + 0)(1 - Wjj) 

(1 + 0)(1 - Wmi n ) 


an T (1 T 0)(1 U)rnin) 


< 1 . 


|rt fe || < ||V$(s* 


Putting together (42)-(|46j), for 0 > 0 and a satisfying (41) we obtain 
||V$(a; fc ) + V 2 $(x fe )s fc || < h(a)\\u k \\ < h(a)\\ V$(a; fe )||, 
i.e. the statement holds with 

t = h(a) = 1 — 2 an + a 2 L 2 


(46) 


(47) 


□ 

Applying Theorem |3.5| once again, we get the local linear convergence as 
stated in the following corollary. 

Corollary 4.1. Assume that the conditions of Theorem \4-l\ hold. Then there 
exists rj such that for every x° satisfying ||a: 0 — a;* || < rj the sequence {a; fe }, gen¬ 
erated by DQN-2 method with Steps 6.1-6.3 of Algorithm 3 and e = 1, converges 
linearly to x* and 

||V 2 $0r*)(:r fc+1 -®*)|| < t||V 2 $(a;*)(:r fc -x*)||, A: = 0,1,... (48) 


holds with t given by 


We remark that, for strongly convex quadratic ff s, the result analogous to 


Theorem 3.6 holds in the sense of global linear convergence, i.e., inequality (48) 
holds for all k and arbitrary initial point x°. 

Remark. ***An interesting future research direction is to adapt and analyze 
convergence of the DQN methods in asynchronous environments, as it has been 
already numerically studied recently in [TO]. Therein, it is shown that the stud¬ 
ied second order methods still converge in an asynchronous setting, though with 
a lower convergence speed. 
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4.2 Discussion on the tuning parameters 


Let us now comment on the choice of the involved parameters - matrix W 
and scalars a , p, e, 9 , and S. We first consider the general DQN method in 
Algorithm 1, i.e., our comments apply to all DQN-K variants, £ = 0,1, and 2. 

Matrix W only needs to satisfy that: 1) the underlying support network is 
connected and 2) all diagonal entries wu lie between w m in and w max , where 
0 < w m in < w max < 1- Regarding the latter condition, it is standard and 
rather mild; it is only required for 0 to hold, i.e., to ensure that solving 0 
gives an approximate solution to the desired problem 0- Regarding the second 
condition, it can be easily fulfilled through simple weight assignments, e.g., 
through the Metropolis weights choice; see, e.g., [551 . 

We now discuss the choice of the parameters a, p, 9, e, and S. First, a defines 
the penalty reformulation (4), and therefore, it determines the asymptotic error 
that the algorithm achieves. The smaller a, the smaller the limiting (saturation) 
error of the algorithm is, but the slower the convergence rate is. Thus, the pa¬ 
rameter should be set a priori according to a given target accuracy; see also [ID]. 
A practical guidance, as considered in m , is to set a = 1/ (K, L) , where L is the 
Lipschitz gradient constant as in the paper, and JC = 10-100. Next, parameter 
9 > 0 determines the splitting of the Hessian. It can simply be taken as 9 = 0, 
and a justification for this choice can be found in Section 5. Next, the role of 
S is mainly theoretical. Namely, Theorems 3.2 and 3.3 consider generic choices 
of Lfc’s, and they are worst-case type results. Therein, S essentially trades off 
the guaranteed worst-case (global linear) convergence factor with the size of 
admissible range of the L/-’s (size of the maximal allowed p ). As per (18), a 

-y. Having 


reasonable choice to balance the two effects is S = », j-rrrrBs 


set 6, the remaining two parameters, e and p, can be set according to (27) 
and (18), respectively. As noted, the described choice of the triple (S,p,e) is a 
consequence of the worst case, conservative analysis with Theorems 3.2 and 3.3 
(which still have a theoretical significance, though). In practice, we recommend 
setting 5 = 0, e = 1, and p as the upper bound in (18) with <5 = 0. 

Discussion on distributed implementation. The algorithm’s tuning 
parameters need to be set beforehand in a distributed way. Regarding weight 
matrix W, each node i needs to store beforehand the weights w lt and Wij, 


j £ Oi, for all its neighbors. The weights can be set according to the Metropolis 
rule, e.g., [39], where each node i needs to know only the degrees of its im¬ 
mediate neighbors. Such weight choice, as noted before, satisfies the imposed 
assumptions. 

In order to set the scalar tuning parameters a, 9 , e, and p, each node i 
needs to know beforehand global quantities w m im w max , p and L. Each of 
these parameters represent either a maximum or a minimum of nodes’ local 
quantities. For example, w ma x is the maximum of the wa’s over i = 1, ...,n, 
where node i holds quantity wu. Hence, each node can obtain w max by running 
a distributed algorithm for maximum computation beforehand; for example, 
nodes can utilize the algorithm in [34] . 
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5 Simulations 


This section shows numerical performance of the proposed methods on two 
examples, namely the strongly convex quadratic cost functions and the logistic 
loss functions. 

Simulation setup. Two simulation scenarios with different types of nodes’cost 
functions ft s: 1) strongly convex quadratic costs and 2) logistic (convex) loss 
functions are considered. Very similar scenarios have been considered in [22] [221 
ED- With the quadratic costs scenario, /) : R p —>• R is given by 

fi{x) = ^(x - a,i) T Bi(x - at), 

where Bi £ R pxp is a positive definite (symmetric matrix), and a* 6 R p is a 
vector. Matrices Bi, i = 1 are generated mutually independently, and so 
are the vectors a,’s; also, B t ' s are generated independently from the a»’s. Each 
matrix Bi is generated as follows. First, we generate a matrix Bi whose entries 
are drawn mutually independently from the standard normal distribution, and 
then we extract the eigenvector matrix Q £ W xp of matrix |(B + B T ). We 
finally set Bi = <5Diag(ci)Q T , where Cj £ R p has the entries generated mutually 
independently from the interval [1,101]. Each vector Gq £ R p has mutually inde¬ 
pendently generated entries from the interval [1,11]. Note that ai -the minimizer 
of fi ~is clearly known beforehand to node i. but the desired global minimizer of 
/ is not known by any node i. 

The logistic loss scenario corresponds to distributed learning of a linear clas¬ 
sifier; see, e.g., [lj for details. Each node i possesses J = 2 data samples 
{ a ij>bij}j- 1 - Here, a^- £ R 3 is a feature vector, and bij £ {—1, +1} is its class 
label. We want to learn a vector x = (x^Xo) 1- , X\ £ R p_1 , and x 0 £ R, p > 2, 
such that the total logistic loss with I 2 regularization is minimized: 

n J 

EE tf^logis ipij^X 1 CL -\- Xo)) ~h a~||a:|| , 

»=1 2=1 

Here, j7i og i s (-) is the logistic loss 

Jiogis(-) = log(l + e~ z ), 

and r is a positive regularization parameter. Note that, in this example, we 
have 

j 

fi(x') — ^ ] J^iogis {pij (.Xi CL T Xo)) T || x|| , 

2=1 H 

f(x) = Er=i fi ( x )• r ^^ ie a ij’ s are generated independently over i and j , where 
each entry of a^- is drawn independently from the standard normal distribu¬ 
tion. The “true” vector x * = ((x^) T ,Xo) T is obtained by drawing its entries 
independently from standard normal distribution. Then, the class labels are 
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bij = sign ((.t^) t dij + Xq + e^), where ey’s are drawn independently from nor¬ 
mal distribution with zero mean and standard deviation 0.1. 

The network instances are generated from the random geometric graph 
model: nodes are placed uniformly at random over a unit square, and the node 


pairs within distance r = y are connected with edges. All instances of 
networks used in the experiments are connected. The weight matrix W is set as 
follows. For a pair of nodes i and j connected with an edge, Wij = 2ma x{d d }+i > 
where di is the degree of the node i: for a pair of nodes not connected by an 
edge, we have Wij = 0; and wu = 1 — m b> f° r a -^ *• For the case of reg¬ 
ular graphs, considered in [221 [23 [ 21 ], this weight choice coincides with that 
in [22 [23 [24]. 


The proposed methods DQN are compared with the methods NN-0, NN-1, 
and NN-2 proposed in [21] • The methods NN-£, with i > 3 are not numerically 
tested in [22] and require a large communication cost per iteration. Recall 
that the method proposed in this paper are denoted DQN-£ with L& = 0 as 
DQN-0; it has the same communication cost per iteration k as NN-0, where 
each node transmits one (p-dimensional) vector per iteration. Similarly, DQN-1 
corresponds to NN-1, where two per-node vector communications are utilized, 
while DQN-2 corresponds to NN-2 (3 vector communications per node). 

With both the proposed methods and the methods in [22], the step size 6 = 1 
is used. Step size e = 1 has also been used in [22, 23, H]- Note that both 
classes of methods - NN and DQN - guarantee global convergence with e = 1 
for quadratic costs, while neither of the two groups of methods have guaranteed 
global convergence with logistic losses. For the proposed methods, safeguarding 
is not used with quadratic costs. With logistic costs, the safeguarding is not 
used with DQN-0 and 2 but it is used with DQN-1, which diverges without the 
safeguard on the logistic costs. The safeguard parameter p defined as the upper 
bound in (18) with 5 = 0 is employed. Further, with all DQNs, 6 = 0 is used. 
With all the algorithms, each node’s solution estimate is initialized by a zero 
vector. 

The following error metric 


1 

n 


E 1 


1c ~ic 

Xi — X 


2 


X * ^ o, 


is used and refered to as the relative error at iteration k. 

Figure [I] (left) plots the relative error versus the number of iterations k for 
a network with n = 30 nodes, and the quadratic costs with the variable dimen¬ 
sion p = 4. First, we can see that the proposed DQN-£ methods perform better 
than their corresponding counterparts NN-^, £ = 0,1, 2. Also, note that the per¬ 
formance of DQN-1 and DQN-2 in terms of iterations match in this example. 
Figure [l] (right) plots the relative error versus total number of communications. 
We can see that, for this example, DQN-0 is the most efficient among all meth¬ 
ods in terms of the communication cost. Further, interestingly, the performance 
of NN-0, NN-1, and NN-2 is practically the same in terms of communication 
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cost on this example. The clustering of the performance of NN-0, NN-1, and 
NN-2 (although not so pronounced as in our examples) emerges also in Simu¬ 
lations in [221 [231 EH- Also, the performance of DQN-0 and NN-1 practically 
matches. In summary, method DQN-0 shows the best performance in terms of 
communication cost on this example, while DQN-1 and 2 are the best in terms 
of the number of iterations k. 

The improvements of DQN over NN are mainly due to the different splitting 
parameter 0 = 0. Actually, our numerical experience suggests that NN-f? may 
perform better than DQN-£ with 0 = 1 for £ = 1,2. We provide an intuitive 
explanation for the advantages of choice 0 = 0, focusing on the comparison 
between NN-0 and DQN-0. Namely, the adopted descent directions with both 
of these methods correspond to the quality of the zeroth order Taylor expansion 
of the following matrix: 


(l — (Afc(0)) -1 G) 1 ss I. 


(49) 


In |49| ), with NN-0, we have A*,(0) = A*(0 = 1), while with DQN-0, we have 
that: Afc(0) = A*,(0 = 0); note that these two matrices are different. Now, the 
error (remainder) of the Taylor approximation is roughly of size ||(Afc(0))~ 1 Gf||. 
In view of the upper bound in (301, we have with NN-0 that the remainder is 
of size: 

,_i ii 1 a p 


ll(A fc (l))- 1 G|| 


1 - 


2(1 Wmin) 

for small a /i. On the other hand, we have that the DQN-O’s remainder is: 


(50) 


||(Afc(0)) -i <G| 


1 - 


a p 


1 -Wn 


Therefore, the remainder is (observed through this rough, but indicative, es¬ 
timate) larger with NN-0, and that is why DQN-0 performs better. We can 
similarly compare NN-1 (which corresponds to the first order Taylor approxi¬ 
mation of the matrix in (49)) with DQN-0. The remainder with NN-1 is roughly 
||(Afc(l)) _1 <G|| 2 ~ 1 — , which equals to the remainder estimate of DQN- 

0. This explains why the two meth ods perform very similarly. Finally, note 
that the upper bound on ||<GA)7 1 || in (301 is an increasing function of 0 > 0 (the 


lower the 0, the better the bound), which justifies the choice 0 = 0 adopted here 
for DQN. 

Figure [2] (left and right) repeats the plots for the network with n = 400 
nodes, quadratic costs, and the variable dimension p = 3. One can see that 
again the proposed methods outperform their respective NN-£ counterparts. In 
terms of communication cost, DQN-0 and DQN-1 perform practically the same 
and are the most efficient among all methods. 

Figure [2] plots the relative error versus number of iterations (left) and num¬ 
ber of per-node communications (right) for the logistic losses with variable di¬ 
mension p = 4 and the network with n = 30 nodes. One can see that again 
the proposed methods perform better than the NN-£ counterparts. In terms of 
the communication cost, DQN-0 is the most efficient among all methods, while 
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Figure 1: Relative error versus number of iterations k (left) and versus number 
of communications (right) for quadratic costs and n = 30-node network. 
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Figure 2: Relative error versus number of iterations k (left) and versus number 
of communications (right) for quadratic costs and n = 400-node network. 
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DQN-2 is fastest in terms of the number of iterations. Finally, Figure [4] repeats 
the plots for variable dimension p = 4 and the network with n = 200 nodes, 
and it shows similar conclusions: among all DQN and NN methods, DQN-0 is 
the most efficient in terms of communications, while DQN-2 is fastest in terms 
of the number of iterations. 




Figure 3: Relative error versus number of iterations k (left) and versus number 
of communications (right) for logistic costs and n = 30-node network. 


6 Extensions 

As noted before, DQN methods do not converge to the exact solution of (1) 
but to a solution neighborhood controlled by the step size a. As such, for 
high solution accuracies required, they may not be competitive with distributed 
second order methods which converge to the exact solution [28l [38] [27]. 

However, we can exploit the results of [25] and “embed” the DQN algorithms 
in the framework of proximal multiplier methods (PMMs), just like [25J embeds 
the NN methods into the PMM framework. We refer to the resulting algorithms 
as PMM-DQN-£, £ = 0,1,2. (Here, PMM-DQN-£ parallels DQN-£ in terms of 
complexity of approximating Hessians, and in terms of the communication cost 
per iteration.) It is worth noting that the contribution to embed distributed sec¬ 
ond order methods into the PMM framework is due [25] . Here we extend [25] to 
demonstrate (by simulation) that the DQN-type Hessian approximations within 
the PMM framework yield efficient distributed second order methods. 

We now briefly describe a general PMM-DQN method; for the methodology 
to devise distributed second order PMM methods, we refer to [28] . See also the 
Appendix for further details. We denote by x k = (x k ,..., x k ) £ R rap the current 
iterate, where x k £ is node i’s estimate of the solution to (1) at iteration k. 
Besides x k (the primal variable), the PMM-DQN method also maintains a dual 
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number of per-node communications 


Figure 4: Relative error versus number of iterations k (left) and versus number 
of communications (right) for logistic costs and n = 200-node network. 


variable q k = (qf,...,q k ) el" p , where q k £ is node i’s dual variable at 
iteration k. Quantities H/., A&, G and 'gu defined below play a role in the PMM- 
DQN method and are, respectively, the counterparts of V 2 d>(a; fe ), A*,, G and 
V$(x fe ) with DQN: 



= \7 2 F(x k )+p (I-Z)+e pmm I = A fc -G 

(51) 


= S7~F(x k ) + P (I — Zd) + e pmm I + P 9 (I — Z^) 

(52) 

G 

= P Z u + p 9 (I — Zd) 

(53) 

9k 

= X7F(x k )+ p(l-Z)x k +q k . 

(54) 


Here, 9 > 0 is the splitting parameter as with DQN, /3 > 0 is the dual step size 
and e pmm > 0 relates to the proximal term of the corresponding augmented La- 
grangian; see [2S] for details. We now present the general PMM-DQN algorithm. 
Note from Step 2 the analogous form of the Hessian inverse approximation as 
with DQN; the approximation is again parameterized with a (np) x ( np ) diagonal 
matrix L^. 

Algorithm 4: PMM-DQN in vector format 

Given x° = 0, p, e pmm , p > 0, 6 > 0. Set k = 0. 

Step 1. Chose a diagonal matrix L*, £ jgyipxnp suc h that 

l|Lfc|| < p- 


Step 2. Set 


s fc = -(I — LfcG)A fc 1 gk- 
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Step 3. Set 


x k+1 = x k 


Step 4. Set 

q k+1 =q k + (I — Z)x k+1 ; k = k + 1. 

The same algorithm is presented below from the distributed implementation 
perspective. (Note that here we adopt the notation similar to DQN, i.e., the 
(i, j)-th px p block of G is denoted by Gif the *-th p x p diagonal block of L fc 
is denoted by A k ; and i-th px p diagonal block of is denoted by A k .) 


Algorithm 5: PMM-DQN Distributed implementation 

At each node i, require (3, p , e pmm > 0, 9 > 0. 

1 Initialization: Each node i sets k = 0 and X? = q? = 0. 

2 Each node i calculates 


d k = 


(%) 


-1 


+ P Wi i ~ Xj) + Qi 

j&Oi 


3 Each node i transmits d k to all its neighbors j £ Oi and receives d k from 
all j £ Oi. 

4 Each node i chooses a diagonal px p matrix A k , such that ||A^|| < p. 

6 Each node i calculates: 

s^-df + Af^Gijdf. 

j&Oi 

6 Each node i updates its solution estimate as: 

x k+1 = x k +s k . 

7 Each node i transmits x k+1 to all its neighbors j £ Oi and receives x k+1 
from all j £ Oi. 

8 Each node i updates the dual variable q k as follows: 

r 1 =^+e (sr 1 - x k+i ) ■ 

j&Oi 


9 Set k = k + 1 and go to Step 2. 
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Note that, at Step 2, each node i needs the neighbors’ estimates x k , j £ O;. 
For k > 1, the availability of such information is ensured through Step 7 of the 
previous iteration k — 1; at k = 0, Step 2 is also realizable as it is assumed that 
= 0, for all i = 1, ...,n. As with the DQN methods, different variants PMM- 
DQN-f mutually differ in the choice of matrix L*,. With PMM-DQN-0, we set 
Lfc = 0; with PMM-DQN-2, L*, is set as described below; with PMM-DQN-1, 
we set Lfc = L = const to Lo, i.e., to the value of from the first iteration of 
PMM-DQN-2. 

We now detail how L^, is chosen with PMM-DQN-2. The methodology is 
completely analogous to DQN-2: the idea is to approximate the exact Newton 
direction which obeys the following Newton-type equation: 


Hfc s + gk — 0, 


(55) 


through Taylor expansions. Using (55) and completely analogous steps as with 
DQN-2, it follows that L& is obtained through solving the following system of 
linear equations with respect to L&: 


Uu=~ 


ft- 


-1 + 


{ft- 




{ft + £pmm ) 2 


V 2 F{x k ) u 


where u k =GA, 1 g k . 


(56) 


The overall algorithm for setting L*. with PMM-DQN-2 (step 4 of Algorithm 5) 
is presented below. 


Algorithm 6: Computing La, with PMM-DQN-2 

4.1 Each node i calculates 


u k =Y J G^ dk . 

j&Oi 


4.2 Each node i transmits u k to all its neighbors j £ Oi and receives u k from 
all j £ Oi . 

4.3 Each node i calculates A k - the solution to the following system of equa¬ 
tions (where the only unknown is the p x p diagonal matrix A^): 


A iut = - 


Pwu 


ft + e F 


)I~ 


1 


{ft + £pmm ) 2 {ft + tpmm) 


2 Mx k ) 


X u k - 


ft 


{ft + £pmm ) 2 


E 

jeo. 


Wij u k . 


4.4 Each node i projects each diagonal entry of A k onto the interval [—p, p]. 

Simulations. We now compare by simulation the PMM-DQN-£ methods 
with the ESOM-£ algorithms proposed in (28], the DQM algorithm in |27], and 
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different variants of the Newton Raphson Consensus (NRC) algorithm proposed 
in 381 . 

The simulation setup is as follows. The network is an instance of the 
random geometric graph with n = 30 nodes and 166 links. The optimiza¬ 
tion variable dimension is p = 4, and the local nodes costs are strongly con¬ 
vex quadratic, generated at random in the same way as with the quadratic 
costs examples in Section 5. With all methods which involve weighted av¬ 
eraging (ESOM-f?, PMM-DQN-Q and NRC), we use the Metropolis weights. 
We consider the ESOM-f and PMM-DQN-£ methods for £ = 0,1,2. With the 
methods ESOM-£ and PMM-DQN-£, we set the proximal constant e pmm = 10 
(see |2S] for details). Further, we tune the dual step size f3 separately for each 
of these methods to the best by considering the following candidate values: 
/3 € {10 —4 ,10 -3 - 5 ,10 -3 ,..., 10 3 - 5 ,10 4 }, i.e., a grid of points equidistant on the 
log 10 scale with the half-decade spacing. The algorithm DQM has the tuning 
step size parameter c > 0 (see |27] for details) which we also tune to the best us¬ 
ing the same grid of candidate values. Regrading the methods proposed in [58], 
we consider both the standard (NRC) and the accelerated (FNRC) algorithm 
variant. These methods have a communication cost per node, per iteration 
which is quadratic in p , due to exchanging local Hessian estimates. (Specifi¬ 
cally, as it is sufficient to transmit the upper-triangular part of a local Hessian 
(due to the matrix symmetry), each node per iteration transmits p x (p + l)/2 
scalars for the Hessian exchange.) This is different from the ESOM, DQM, and 
PMM-DQN methods which have a cost linear in p. We also consider the Jacobi 
variants in [38] (both the standard - JC and the accelerated - FJC variant) 
which approximate local Hessians through diagonal matrices and hence their 
communication cost per node, per iteration reduces to a cost linear in p. With 
NRC, FNRC, JC, and FJC, we set their step size e to unity (see [58] for de¬ 
tails), as this (maximal possible) step size value yielded fastest convergence. 
We observed that JC and FJC converged with a non-zero limiting error, while 
decreasing the value of e did not improve the limiting error of the method while 
slowing down the methods. Hence, with all the methods considered, we tune 
their step sizes to the best (up to the finite candidate grid points resolution). 
With FNRC and FJC, we set the acceleration parameter (f> in the same way 
as in [58 (see page 10 of 08].) With all PMM-DQN methods, we do not use 
safeguarding ( p = + 00 ), and we set 0 = 0. With all the methods considered, the 
primal variables - solution estimates (and dual variables, if exist) are initialized 
with zero vectors. The error metric is the same as with the quadratic example 
in Section 5. 

Figure 5 compares the PMM-DQN-^ algorithms and the ESOM-£ algorithms 
in [28] in terms of the number of iterations (left) and the number of per-node 
communications (right). We can see that, in terms of iterations, for each fixed £, 
the corresponding PMM-DQN method performs better than the corresponding 
ESOM method. The exception is the case £ = 0 where the two methods are 
comparable. The same conclusions hold for the number of communications also. 
Further, in terms of the number of iterations, PMM-DQN-1 and PMM-DQN-2 
are the best among all methods; in terms of communications, PMM-DQN-1 is 
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the best method among all PMM-DQN and ESOM method considered. 

In Figure 6, we compare the best among the PMM-DQN-t' methods, the best 
among the ESOM-^ methods (in terms of iterations, the best are PMM-DQN-1 
and ESOM-2, while in terms of communications, the best are PMM-DQN-1 and 
ESOM-O), DQM, and the NRC methods group (NRC, FNRC, JC, and FJC). We 
can see that the PMM-DQN-1 method converges faster than all other methods, 
both in terms of iterations and communications. 




Figure 5: Comparison between the PMM-DQN-£ algorithms and the ESOM-f' 
algorithms in (2&j . The figures plot relative error versus number of iterations k 
(left) and versus number of communications (right) for quadratic costs and 
n = 30-node network. 


7 Conclusions 

The problem under consideration is defined by an aggregate, network-wide sum 
cost function across nodes in a connected network. It is assumed that the cost 
functions are convex and differentiable, while the network is characterized by 
a symmetric, stochastic matrix W that fulfils standard assumptions. The pro¬ 
posed methods are designed by exploiting a penalty reformulation of the original 
problem and rely heavily on the sparsity structure of the Hessian. The general 
method is tailored as a Newton-like method, taking the block diagonal part of 
the Hessian as an approximate Hessian and then correcting this approximation 
by a diagonal matrix L*,. The key point in the proposed class of methods is to 
exploit the structure of Hessian and replace the dense part of the inverse Hessian 
by an inexpensive linear approximation, determined by matrix L/ c . Depending 
on the choice of L^, one can define different methods, and three of such choices 
are analyzed in this work. An important characteristic of the whole class of 
DQN methods is global linear convergence with a proper choice of the step size. 
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Figure 6: Comparison between the DQM algorithm in EH, the NRC algorithm 
ho [38], the best among the PMM-DQN-£ algorithms, and the best among the 
ESOM-t* algorithms. The Figures plot relative error versus number of itera¬ 
tions k (left) and versus number of communications (right) for quadratic costs 
and n = 30-node network. 


Furthermore, we have shown local linear convergence for the full step size using 
the convergence theory of Inexact Newton methods as well as global conver¬ 
gence with the full step size for the special case of strictly convex quadratic loss 
functions. 

The three specific methods are analyzed in detail, termed DQN-0, DQN-1 
and DQN-2. They are defined by the three different choices of matrix L*, - the 
zero matrix, a constant matrix, and the iteration-varying matrix that defines a 
search direction which mimics the Newton direction as much as possible under 
the imposed restrictions of inexpensive distributed implementation. For the last 
choice of the time varying matrix, we have shown local linear convergence for 
the full step size without safeguarding. 

The cost in terms of computational effort and communication of these three 
methods correspond to the costs of the state-of-the-art Network Newton meth¬ 
ods, NN-0, NN-1 and NN-2, which are used as the benchmark class in this paper. 
The simulation results on two relevant problems, the quadratic loss and the lo¬ 
gistic loss, demonstrate the efficiency of the proposed methods and compare 
favorably with the benchmark methods. Finally, applying the recent contri¬ 
butions of [2H], the proposed distributed second order methods were extended 
to the framework of proximal multiplier methods. Unlike DQN, the modified 
methods converge to the exact solution and further enhance the performance 
when high solution accuracies are required. 
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Appendix 

Following [28], we briefly explain how we derive the PMM-DQN methods. The 
starting point for PMM-DQN is the quadratically approximated PMM method, 
which takes the following form (see [28] for details and derivations): 


x k+ 1 = x k 


(X7F{x k )+q k +l3{ I~Z)x k ) 


sfc+i 


= q k + /3{ H-Z)s 


fc + 1 


(57) 

(58) 


Here, /3 > 0 is the (dual) step size, is given in (51), and x k £ R” p and 
q k £ R np are respectively the primal and dual variables at iteration k = 0,1,..., 
initialized by x° = q° = 0. 


The challenge for distributed implementation of (571 (58) is that the inverse 


of Hfc does not respect the sparsity pattern of the network. The ESOM methods, 
proposed in |28j , approximate the inverse of Hj, following the NN-type approx¬ 
imations |22| . Here, we extend such possibilities and approximate the inverse 


of Hfc through the DQN-type approximations. This is defined in (51)-(531 and 
Algorithm PMM-DQN in Section 6. 

As noted, the matrix L/, = 0 with PMM-DQN-0, and it is L& = Lo = const 
with PMM-DQN-1, where Lo is the matrix from the first iteration of the DQN-2 
method. It remains to derive L& for PMM-DQN-2, as it is given in Section 6. 
As noted in Section 6, we approximate the Newton equation in (55). The 
derivation steps are the same as with DQN-2, with the following identification: 
V 2 <b(:r fe ) in (34) is replaced with H*, in (55), and V<l>(a; fc ) with fj k in (55). 


Then, equation (37) with DQN-2 transforms into the following equation with 
PMM-DQN-2: 


Lfc u ' = —I 


(59) 


where u k is given in (56). Finally, it remains to approximate 1 through a 


first order Taylor approximation, as follows: 




(A 


) i- 


P + £ p 


A 


A 


-z- 


c pmm 


-z- 


-V 2 F(x k ) 


c pmm 

-V 2 F(x k ) 
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The above Taylor approximation is well defined if the spectral radius of matrix 
—Z — — V 2 F(x k )^j is strictly less than one. It is easy to verify 

that this will be the case if the (positive) parameters /3 and e pmm satisfy /? > 

2 max{ 0 5 L Cpmm } • 
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